Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer

نویسندگان

Utpala Musti

Vincent Colotte

Asterios Toutios

Slim Ouni

چکیده

In this paper, we present a method to take into account visual information during the selection process in an acoustic-visual synthesizer. The acoustic-visual speech synthesizer is based on the selection and concatenation of synchronous bimodal diphone units i.e., speech signal and 3D facial movements of the speaker’s face. The visual speech information is acquired using a stereovision technique. Unit selection for synthesis is based on the classical target cost consisting of linguistic and phonological features. We compare several methods to take into account the visual articulatory context in the target cost. We present an objective evaluation of the synthesis results based on correlation of the actual visual speech trajectory and synthesized visual speech trajectory.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic feature selection for acoustic-visual concatenative speech synthesis: towards a perceptual objective measure

We present an iterative algorithm for automatic feature selection and weight tuning of target cost in the context of unit selection based audio-visual speech synthesis. We perform feature selection and weight tuning for a given unit-selection corpus to make the ranking given by the target cost function consistent with the ordering given by an objective dissimilarity measure. We explicitly perfo...

متن کامل

Towards a true acoustic-visual speech synthesis

This paper presents an initial bimodal acoustic-visual synthesis system able to generate concurrently the speech signal and a 3D animation of the speaker’s face. This is done by concatenating bimodal diphone units that consist of both acoustic and visual information. The latter is acquired using a stereovision technique. The proposed method addresses the problems of asynchrony and incoherence i...

متن کامل

A text-to-audiovisual-speech synthesizer for French

An audiovisual speech synthesizer from unlimited French text is here presented. It uses a 3-D parametric model of the face. The facial model is controlled by eight parameters. Target values have been assigned to the parameters, for each French viseme, based upon measurements made on a human speaker. Parameter trajectories are modeled by means of dominance functions associated with each paramete...

متن کامل

Google's Next-Generation Real-Time Unit-Selection Synthesizer Using Sequence-to-Sequence LSTM-Based Autoencoders

A neural network model that significant improves unitselection-based Text-To-Speech synthesis is presented. The model employs a sequence-to-sequence LSTM-based autoencoder that compresses the acoustic and linguistic features of each unit to a fixed-size vector referred to as an embedding. Unit-selection is facilitated by formulating the target cost as an L2 distance in the embedding space. In o...

متن کامل

Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads

This paper investigates audio-visual unit selection for the synthesis of photo-realistic, speech-synchronized talking-head animations. These animations are synthesized from recorded video samples of a subject speaking in front of a camera, resulting in a photo-realistic appearance. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer

نویسندگان

چکیده

منابع مشابه

Automatic feature selection for acoustic-visual concatenative speech synthesis: towards a perceptual objective measure

Towards a true acoustic-visual speech synthesis

A text-to-audiovisual-speech synthesizer for French

Google's Next-Generation Real-Time Unit-Selection Synthesizer Using Sequence-to-Sequence LSTM-Based Autoencoders

Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads

عنوان ژورنال:

اشتراک گذاری